NOTE: if you are not sure why the following exercises are useful or relevant to the multiple regression session then bear with me to the end; it will become clearer!
Sequences and designs for sampling and allocation
One trick you may find useful later in the course is making sequences of numbers.
There are a few ways to do this, but the simplest is to write: 1:10. That is, the number to start from (1), a colon (:), and then the number to end with (10).
Copy and paste these examples to see the output:
1:10## [1] 1 2 3 4 5 6 7 8 9 10
20:30## [1] 20 21 22 23 24 25 26 27 28 29 30
Explanation: The output shows that R has created a sequence of whole numbers between the start and finish number.
To get a sequence with only even numbers, we can use the seq function, and set the by argument to 2:
seq(from=2, to=10, by=2)## [1] 2 4 6 8 10
You can set by to any number, including a decimal:
seq(0, 27, by=3)## [1] 0 3 6 9 12 15 18 21 24 27
seq(0, 1, by=0.2)## [1] 0.0 0.2 0.4 0.6 0.8 1.0
If your sequence doesn’t have a simple pattern, you can also write out the numbers by hand using the c() function E.g.:
c(1,40,92,188)## [1] 1 40 92 188
Explanation: c(...) is short for combine, so this command combines the numbers 1, 40, 92, 188 into a new sequence. This is sometimes called a vector in R-speak.
Make some sequences which include:
- Even numbers from 10 to 20
- Numbers in the 8 times table less than 200
- 20 evenly spaced numbers between zero and 1 (including zero and 1)
- The words “Wibble”, “Wobble” and “Bobble”
We can use seq for numbers:
seq(10,20,by=2)## [1] 10 12 14 16 18 20
seq(0,200, 8)## [1] 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144
## [20] 152 160 168 176 184 192 200
seq(0,1, by=1/19)## [1] 0.00000000 0.05263158 0.10526316 0.15789474 0.21052632 0.26315789
## [7] 0.31578947 0.36842105 0.42105263 0.47368421 0.52631579 0.57894737
## [13] 0.63157895 0.68421053 0.73684211 0.78947368 0.84210526 0.89473684
## [19] 0.94736842 1.00000000
But we need to use c() for lists of words:
c("Wibble", "Wobble", "Bobble")## [1] "Wibble" "Wobble" "Bobble"
Combinations of sequences
In designing experiments, or creating a grid of numbers for making predictions, we often want to create combinations of different categories which represent conditions or stimuli.
Imagine a hypothetical study with a test phase where participants are presented with multiple words, in either red or green text, and shown at either the bottom or top of the computer screen.
The combinations look something like this:
| condition | colour | position | word |
|---|---|---|---|
| 1 | Red | Top | Nobble |
| 2 | Green | Top | Nobble |
| 3 | Red | Bottom | Nobble |
| 4 | Green | Bottom | Nobble |
| 5 | Red | Top | Wobble |
| 6 | Green | Top | Wobble |
| 7 | Red | Bottom | Wobble |
| 8 | Green | Bottom | Wobble |
| 9 | Red | Top | Hobble |
| 10 | Green | Top | Hobble |
| 11 | Red | Bottom | Hobble |
| 12 | Green | Bottom | Hobble |
R provides quick ways of creating combinations of variables, using a command called expand.grid.
First, we need to create a sequence of each of the possible values for our categories:
colours = c("Red", "Green")
positions = c("Top", "Bottom")
words = c("Nobble", "Wobble", "Hobble")Then we can use expand.grid to give us all the possible combinations of these:
expand.grid(colour=colours, position=positions, words = words)## colour position words
## 1 Red Top Nobble
## 2 Green Top Nobble
## 3 Red Bottom Nobble
## 4 Green Bottom Nobble
## 5 Red Top Wobble
## 6 Green Top Wobble
## 7 Red Bottom Wobble
## 8 Green Bottom Wobble
## 9 Red Top Hobble
## 10 Green Top Hobble
## 11 Red Bottom Hobble
## 12 Green Bottom Hobble
Explanation: The expand.grid function has taken the items in the three input sequences (colours, positions and words) and created a dataframe which contains all the possible combinations. We could save these to a file if we wanted to use them as part of our experiment.
Create some experimental designs of your own
Reproduce the experiment design above by copying and pasting
Adapt the commands to allow for an experiment where the word position could be either top, bottom, left or right. How many different conditions would there be in this case?
As a stretch task (this might be sligtly harder): How would you create a design where a sequence of 3 words is presented. Each word must be different, but each 3-word combination can be presented in red or green text, and at the top or bottom of the screen?
expand.grid(colour=colours, position=positions, word1 = words, word2=words, word3=words) %>%
filter(
word1 != word2 & word2 != word3 & word1 != word3
) ## colour position word1 word2 word3
## 1 Red Top Hobble Wobble Nobble
## 2 Green Top Hobble Wobble Nobble
## 3 Red Bottom Hobble Wobble Nobble
## 4 Green Bottom Hobble Wobble Nobble
## 5 Red Top Wobble Hobble Nobble
## 6 Green Top Wobble Hobble Nobble
## 7 Red Bottom Wobble Hobble Nobble
## 8 Green Bottom Wobble Hobble Nobble
## 9 Red Top Hobble Nobble Wobble
## 10 Green Top Hobble Nobble Wobble
## 11 Red Bottom Hobble Nobble Wobble
## 12 Green Bottom Hobble Nobble Wobble
## 13 Red Top Nobble Hobble Wobble
## 14 Green Top Nobble Hobble Wobble
## 15 Red Bottom Nobble Hobble Wobble
## 16 Green Bottom Nobble Hobble Wobble
## 17 Red Top Wobble Nobble Hobble
## 18 Green Top Wobble Nobble Hobble
## 19 Red Bottom Wobble Nobble Hobble
## 20 Green Bottom Wobble Nobble Hobble
## 21 Red Top Nobble Wobble Hobble
## 22 Green Top Nobble Wobble Hobble
## 23 Red Bottom Nobble Wobble Hobble
## 24 Green Bottom Nobble Wobble Hobble
Explanation of the command: There may be a neater way of doing this, but here we simply:
- Create all combinations of words, colors, spatial locations and positions in the sequence (1,2 or 3)
- Filter our rows where two of the words are the same
Another way to do the same thing using the unique and length functions might be:
expand.grid(colour=colours, position=positions, word1 = words, word2=words, word3=words) %>%
rowwise() %>%
filter(length(unique(c(word1, word2, word3)))==3)## # A tibble: 24 x 5
## # Rowwise:
## colour position word1 word2 word3
## <fct> <fct> <fct> <fct> <fct>
## 1 Red Top Hobble Wobble Nobble
## 2 Green Top Hobble Wobble Nobble
## 3 Red Bottom Hobble Wobble Nobble
## 4 Green Bottom Hobble Wobble Nobble
## 5 Red Top Wobble Hobble Nobble
## 6 Green Top Wobble Hobble Nobble
## 7 Red Bottom Wobble Hobble Nobble
## 8 Green Bottom Wobble Hobble Nobble
## 9 Red Top Hobble Nobble Wobble
## 10 Green Top Hobble Nobble Wobble
## # … with 14 more rows
Explanation: Here we apply the sample filter, but using the length function to count the number of unique words among word 1, 2 and 3. The rowwise function is used to make R consider each row individually (it won’t work without it because R tries to consider the whole of a column at once).
Take the model from the main worksheet were we predicted grades from work hours for men and women.
In the main worksheet we created a dataframe by hand to tell augment what predictions we wanted.
Now try using expand.grid to make the new dataframe instead. For example, try making predictions for men and women who work 20, 25, 30, 35, or 40 hours per week. Make this dataframe using expand.grid and without using the c() function.
# setup and read data
library(tidyverse)
studyhabits <- read_csv('https://benwhalley.github.io/rmip/data/studyhabitsandgrades.csv')
# run the model
second.model <- lm(grade ~ work_hours * female, data = studyhabits)
# made the new data grid for predictions
prediction.grid <- expand.grid(work_hours=seq(20,40,5), female=c(TRUE, FALSE)) %>%
as_tibble
prediction.grid## # A tibble: 10 x 2
## work_hours female
## <dbl> <lgl>
## 1 20 TRUE
## 2 25 TRUE
## 3 30 TRUE
## 4 35 TRUE
## 5 40 TRUE
## 6 20 FALSE
## 7 25 FALSE
## 8 30 FALSE
## 9 35 FALSE
## 10 40 FALSE
# make the predictions
new.predictions <- broom::augment(second.model, newdata=prediction.grid)
new.predictions## # A tibble: 10 x 4
## work_hours female .fitted .se.fit
## <dbl> <lgl> <dbl> <dbl>
## 1 20 TRUE 55.5 0.821
## 2 25 TRUE 59.1 0.542
## 3 30 TRUE 62.7 0.740
## 4 35 TRUE 66.3 1.20
## 5 40 TRUE 69.9 1.72
## 6 20 FALSE 54.5 1.29
## 7 25 FALSE 63.5 1.46
## 8 30 FALSE 72.5 2.55
## 9 35 FALSE 81.4 3.84
## 10 40 FALSE 90.4 5.19
# plot predictions from the model with SE
new.predictions %>%
ggplot(aes(work_hours,
y=.fitted,
ymin= .fitted-.se.fit,
ymax= .fitted+.se.fit,
color=female, group=female)) +
geom_line() +
geom_errorbar(width=.1) +
geom_point()# geom_ribbon is another way of showing the error of estimates
new.predictions %>%
ggplot(aes(work_hours,
y=.fitted,
ymin= .fitted-.se.fit,
ymax= .fitted+.se.fit,
color=female, group=female)) +
geom_ribbon(alpha=.1, color=NA) +
geom_line()Clinical trial example
Data from a clinical trial of Functional Imagery Training (Solbrig et al. 2019, FIT) are available at https://zenodo.org/record/1120364/files/blind_data.csv. In this file, group represents the treatment group (FIT=2; Motivational Interviewing=1). The kg1 and kg3 variables represent the patients’ weights in kilograms before and after treatment, respectively. Load these data and complete the following tasks:
Plot the difference in weight between treatment groups at followup (
kg3)Create a plot to show whether men or women benefited most from the treatment (this will require some thinking about what goes on the y-axis, and perhaps some pre-processing of the data).
Create a plot to show whether older participants benefited more or less than younger ones (again this will require some thinking, and there are quite a number of different plot types which could be used, each with different pros and cons).
Run a linear model which is equivalent to the plot you created. Can you match the coefficients in the model output to the points and lines on your graph?
Preview: Multiple continuous predictors
In the previous example plotting our data was quite simple: We had an x and y axis, and used colour to show the third.
If we have two continuous predictors we can do the same, using color or size to indicate the third dimension. Here are two examples using some data on the income, murder and illiteracy rates of US states:
# the data called state.x77 are in an old fashioned format,
# so do this first to make them into a dataframe
states <- as_tibble(state.x77)states %>%
ggplot(aes(x = Murder, y = Income, color=Illiteracy)) + geom_point()3 dimensions using x, y and color
states %>%
ggplot(aes(x = Murder, y = Income, size=Illiteracy)) + geom_point()3 dimensions using x, y and point size
Another alternative is to categorize one of the variables and make a faceted plot:
states %>%
mutate(Illiteracy_categorised = cut(Illiteracy, breaks=3,
labels=c("Low illiteracy", "Medium illiteracy", "Highilliteracy"))) %>%
ggplot(aes(Murder, Income)) +
geom_point() +
facet_wrap(~Illiteracy_categorised) +
geom_smooth(method=lm, se=F)Explanation of the code In the code above, the cut command is used to split the Illiteracy variable into 3 categories (breaks=3) and give each category a label (labels = c("Low", ...)). We then use this categorical version to create a faceted plot for low, medium and high illiteracy values.
If you look at these plots it’s about possible to see that there is an interaction here: As Illiteracy increases, the relationship between Murder and Income changes (it starts positive but ends up slightly negative).
Try describing the plot above in words.
Try re-plotting the data but with different values on different axes (e.g. swap x, y, color, size etc).
Which combination feels most ‘natural’ to you; that is, which is easiest to describe?
3D plots
However, another way of displaying these data is to use a true 3D plot. An example is below:
Explanation of this plot: Rather than a standard x/y plot, this one adds a third dimension, z. This means that to see all the data you need to click and drag on the image to rotate the axes. It can take a bit of getting used to, but one tip is to make sure you keep track of which axis is which as you move the viewing angle.
Explanation of the code: ggplot doesn’t make 3D plots. Instead we use plotly, which is an alternative graphing library. We use the plot_ly function, and pass it the states data as the first argument. Then we define the x, y and z axes. Note that each variable name from the dataset is prefixed by a ~ (tilde) character. This is required by plot_ly; it’s not consistent with ggplot which is a shame.
Regression ‘surfaces’
In the main session we saw we can represent the regression coefficients in a 2D plots. For example:
states %>%
ggplot(aes(Income, Illiteracy)) + geom_point() + geom_smooth(method="lm", se=F)This regression line looks problematic. The residuals (gaps from point to line) are quite large.
But, if we try to add Murder as another predictor of Income we can’t represent the model as a simple line. Instead we need to think of the regression model as a 3d surface:
Play with the interactive plot above.
Is this 3d surface a good ‘fit’ to these data?
In future sessions, as we increase the number of variables in our models, it will be difficult to keep hold of simple graphical equivalents to our models because it’s hard to think in 5 or 6 or more dimensions.
It’s good to remember this type of plot though. As we add more variables the lm function is still trying to minimise the gaps between the predicted line or surface and the actual data.
Solbrig, Linda, Ben Whalley, David J Kavanagh, Jon May, Tracey Parkin, Ray Jones, and Jackie Andrade. 2019. “Functional Imagery Training Versus Motivational Interviewing for Weight Loss: A Randomised Controlled Trial of Brief Individual Interventions for Overweight and Obesity.” International Journal of Obesity 43 (4): 883. https://www.ncbi.nlm.nih.gov/pubmed/30185920.